2024-12-24 de novo prelim

Introduction

3 days ago i was looking through the gtdbtk manual and saw that de_novo_wf was an option for analysis to create the trees, from the description given:

knitr::include_url("https://ecogenomics.github.io/GTDBTk/commands/de_novo_wf.html")

i beleived this would be something i should do as it might produce more accurate trees. sample 1Dt2d Enterobacter cancerogenus had been placed by the classify_wf in the previous gtdbtk analysis in the genus Pantoea, which lead me to this search. after a bit of trial and error, i produced this script

Methods

This ran as a slurm job on hawk (SCW) from rougly 20:10 on the 23rd to 01:00 on the 24th, totalling 4 hours and 50 minutes. The main parameters that i experimented with were

- #SBATCH --ntasks=5
- #SBATCH --time=24:00:00
- #SBATCH --mem=50g
- --cpus 10

I settled on these as being the “best”, however, it is entirely possible that they could be more optimised.

Results

This analysis produced these files:

/scratch/scw2160/02_outputs/flye_asm/gtdb_tk_de_novo5/
.:
text.txt
ls
touch
list.txt
align
gtdbtk.bac120.decorated.tree
gtdbtk.bac120.decorated.tree-table
gtdbtk.log
identify
infer
gtdbtk.warnings.log

./align:
gtdbtk.bac120.msa.fasta.gz
gtdbtk.bac120.user_msa.fasta.gz
gtdbtk.bac120.filtered.tsv

./identify:
gtdbtk.ar53.markers_summary.tsv
gtdbtk.bac120.markers_summary.tsv
gtdbtk.translation_table_summary.tsv
gtdbtk.failed_genomes.tsv

./infer:
gtdbtk.bac120.decorated.tree
gtdbtk.bac120.decorated.tree-taxonomy
gtdbtk.bac120.decorated.tree-table
intermediate_results

./infer/intermediate_results:
gtdbtk.bac120.rooted.tree
gtdbtk.bac120.fasttree.log
gtdbtk.bac120.tree.log
gtdbtk.bac120.unrooted.tree

I then moved this gtdbtk.bac120.decorated.tree file into Dendroscope for review, all 10 are on one tree, but 1Dt2d is still being placed in the “wrong” genus. on review of its sister accession on the ncbi database.

Conclusion

On the NCBI page for the sister accession, can be found a CheckM analysis that comes back with

completeness: 90%
contamination: 3.6%
Taxonomy check status: failed

Upon viewing the tree in Dendroscope, the joining node has the label 0.968. This I believe to be the probability the relationship is correct. this implies they are the same species, and the online sample is also identified as Enterobacter cancerogenus. However, due to the checkm analysis i find it plausible that they both have been misidentified and are in reality Pantoea species, i find this the most parsimonious explanation. I will follow this up with a CheckM analysis of my own on 1Dt2d

dendroscope screenshot showing location and relationship for 1Dt2d after de novo analysis
dendroscope screenshot showing location and relationship for 1Dt2d after de novo analysis

This was a “technical spike” or proof of concept for de_novo_wf

📌 TODO: do another Checkm analysis on 1Dt2d to see if the values are similar to the online sample
mtcars[1:5, "mpg"]
## [1] 21.0 21.0 22.8 21.4 18.7

To make sure that we always get a data frame, we have to use the argument drop = FALSE. Now we use the chunk option class.source = "bg-success".